A Tidy Framework and Infrastructure to Systematically Assemble Spatio-temporal Indexes from Multivariate Data

H. Sherry Zhang

JSM 2024 Portland, Oregon

Aug 7, 2024

Indexes

Sport climbing in the Olympic Games 🧗‍♀️

2020 Tokyo version

Boulder: 4m wall, 3 problems in final

Lead: 15m wall, 1 problem

Speed: 15m wall, always the same

Three disciplines, one gold medal

In Tokyo 2020, athletes are ranked from 1 to 8 (top - bottom) in each discipline. The final score is the multiplication of the ranks in each discipline.

Athletes Country Speed Boulder Lead Total Rank
Janja Garnbret Slovenia 5 1 1 5 1
Miho Nonaka Japan 3 3 5 45 2
Akiyo Noguchi Japan 4 4 4 64 3
Aleksandra Miroslaw Poland 1 8 8 64 4
. . . .. ..

Aleksandra Miroslaw gets 4th despite ranked last in both boulder and lead:

  • 0/3 in boulder problems (0T0Z)
  • scored 9/40 points in the lead (as compared to others: 13, 20, 21, 29, 34, 35, 37)

But she could win a medal if she performs better in the qualification round.

Hence this year, sport climbing has two medals

  • speed + boulder and lead combined
  • boulder-and-lead combined has 200 points, each discipline worth 100 points:
    • boulder: 25 points x 4 problems, partial points of 5 and 10 for zone 1 and zone2
    • lead: counting from the top, the last 10 moves - 4 points each, the previous 10 moves - 3 points each, … (4 x 10 + 3 x 10 + 2 x 10 + 10 = 100)

The game is on this week, but 1/4am PT 😭…

Inspired from tidymodel

A closer look at a class of drought indexes

The pipeline design (9 modules)

data with spatial (\(\mathbf{s}\)) and temporal (\(\mathbf{t}\)) dimensions: \[x_j(s;t)\]

  • Temporal processing: \(f[x_{sj}(t)]\)
  • Spatial processing: \(g[x_{tj}(s)]\)


  • Variable transformation: \(T[x_j(s;t)]\)
  • Scaling: \([x_j(s;t)- \alpha]/\gamma\)
  • Distribution fit: \(F[x_j(s;t)]\)
  • Normalising: \(\Phi^{-1}[x_j(s;t)]\)


  • Dimension reduction: \(h[\mathbf{x}(s;t)]\)
  • Benchmarking: \(u[x(s;t)]\)
  • Simplification
\[\begin{equation} \begin{cases} C_0 & c_1 \leq x(\mathbf{s};\mathbf{t}) < c_0 \\ C_1 & c_2 \leq x(\mathbf{s};\mathbf{t}) < c_1 \\ \cdots \\ C_z & c_z \leq x(\mathbf{s};\mathbf{t}) \end{cases} \end{equation}\]

Software design

DATA |>
  module1(...) |>
  module2(...) |>
  module3(...) |>
  ...

dimension_reduction(V1 = aggregate_linear(...))
dimension_reduction(V2 = aggregate_geometrical(...))
dimension_reduction(V3 = aggregate_manual(...))

The aggregate_*() function can be evaluated as a standalone recipe, before evaluated with the data in the dimension reduction module:

aggregate_manual(~x1 + x2)
[1] "aggregate_manual"
attr(,"formula")
[1] "x1 + x2"
attr(,"class")
[1] "dim_red"

Pipeline for two drought indexes

data %>%                         # data contain `prcp`
  aggregate(                     # step 1: temporal aggregation
    .var = prcp,                 #         aggregate `prcp` with time scale
    .scale = .scale) %>%         #         to create `.agg`, by default
  dist_fit(.dist = .dist,        # step 2: distribution fit
           .method = "lmoms",    #         using L-moment to fit `.dist`
           .var = .agg) %>%      #         distribution on `.agg`
  augment(.var = .agg)           # step 3: normalising 
                                 #         find the normal density for `.agg`
data %>%                                  # data contain `tavg` and `prcp`
  var_trans(                              # step 1: variable transformation
    .method = "thornthwaite",             #         using the thornthwaite function
    .vars = tavg,                         #         on `tavg` 
    .new_name = "pet") %>%                #         to create a new variable `pet` 
  dim_red(diff = prcp - pet) %>%          # step 2: dimension reduction 
  aggregate(                              # step 3: temporal aggregation
    .var = diff,                          #         aggregate `diff` with time scale
    .scale = .scale) %>%                  #         `.scale` to create `.agg`
  dist_fit(                               # step 4: distribution fit
    .dist = dist_gev(),                   #         use the gev distribution 
    .var = .agg,                          #         to fit the variable `.agg`
    .method = "lmoms") %>%                #         using L-moment        
  augment(.var = .agg)                    # step 5: normalising 
                                          #         find the normal density for `.agg`

Confidence interval in the SPI

A bootstrap sample of 100 is taken from the aggregated precipitation series to estimate gamma parameters and to calculate the index SPI for the Texas Post Office station in Queensland.

DATA %>%
  # aggregate monthly precipitation 
  # with a 24-month window
  aggregate(
    .var = prcp, .scale = 24
    ) %>%
  # fit a gamma distribution to 
  # obtain the probability value
  # [0, 1]
  dist_fit(
    .dist = gamma(), .var = .agg, 
    .n_boot = 100
    ) %>%
  # use the inverse CDF to 
  # convert into z-score
  augment(.var = .agg)

Confidence interval in the SPI

80% and 95% confidence interval of the Standardized Precipitation Index (SPI-24) for the Texas post office station, in Queensland, Australia. The dashed line at SPI = -2 represents an extreme drought as defined by the SPI. Most parts of the confidence intervals from 2019 to 2020 sit below the extreme drought line and are relatively wide compared to other time periods. This suggests that while it is certain that the Texas post office is suffering from a drastic drought, there is considerable uncertainty in quantifying its severity, given the extremity of the event.

Global Gender Gap Index

🔗